I. Introduction

Amid the Coronavirus Disease pandemic in 2020, governments around the world developed a response to aid the citizens of their countries and mitigate the spread of the Severe Acute Respiratory Syndrome Coronavirus 2. This study aims to predict the most recent cumulative number of confirmed cases of COVID-19 in different countries, per 10,000 individuals (on the 23rd of October 2020), using the past government responses in these countries to the outbreak (set on the 15th of June 2020), with some other World Bank indicators of interest. This study builds upon the findings of the previous research (project 2) by considering a nonlinear model.

The data used in this study is obtained from The Humanitarian Data Exchange data portal and includes the total population for each country in 20191, the cumulative number of confirmed cases of COVID-19 in different countries2 on the 23rd of October 2020, the Stringency and Economic Support indices on the 15th of June 20203, the total smoking prevalence of people ages 15 and above in 20164, the number of nurses and midwives per 1,000 people in 20185, and the percentage of population that consists of people ages 15 to 64 in 20196.

When comparing the number of infected individuals across countries, the population size of these countries need to be considered. So, the study will look at the cumulative number of confirmed cases of COVID-19 collected on the 23rd of October 2020 in different countries, per 10,000 individuals, and is calculated as: \(\frac{\text{cumulative cases in the country}}{\text{total population of the country}} \cdot 10,000\). This variable and date are the same as those in Project 2, since the aim is to analyse the same cumulative number of confirmed cases data, but with a different approach.

It was shown in Project 2 that the continuous variables Stringency Index and Economic Support Index, on the 15th of June 2020, are adequate variables that have a relationship with the cumulative number of confirmed cases of COVID-19 per 10,000 collected on the 23rd of October 2020. So, these two indexes will be included in this study to quantify the government response to the outbreak. The Stringency index accounts for closure, containment, and public health measures. The Economic Support index accounts for the economic response taken by the governments. Note that the data used provides different government responses for different regions within certain countries, for example the United States of America. Since this study is looking at a country as a whole, the average government response of a country on the 15th of June 2020 will be used, by taking the average government response of all its regions on that day. The data on the 15th of June 2020 was used, just as in Project 2, since the same relationship between the cumulative number of confirmed cases and government response is being analysed, but with a different approach and in addition to other predictor variables.

Three other predictor variables were chosen to be used in this study. The total smoking prevalence of people ages 15 and above is considered, since smoking involves having a possibly contaminated surface (hand, ashtray,..) come in contact with a mucous membrane (mouth), which is how SARS Coronavirus 2 mainly spreads.7 This variable has no data for ages below 15, but that may not be an issue, since it is less likely to have smokers below 15 years old. The earliest data available is from 2016, so that is what will be analysed. The number of nurses and midwives per 1,000 people in 2018 (earliest data available) is also considered, since nurses are in contact with sick or infected individuals with low immunity, which includes identified and non-identified covid 19 patients. Nurses may accidentally fail to take proper precautions to prevent the spread of the disease from one patient to another, so they play a key role in the spread of the virus. The percentage of the population that consists of people ages 15 to 64 in 2019 (earliest data available) is also considered, since people in this age group are in contact with more people than children and retired individuals, due to college, university, work, socializing opportunities, etc.

Other World Bank indicators were available to choose from, but it was decided not to use them. Some reasons that led to that decision were that the earliest data available is from 2013 or before (proportion of population spending8, etc.); some of the variables have more to do with the number of deaths rather than the spread of the disease (percent of death in a year caused by communicable diseases9, etc.); and other variables for which we could not find a logical method to relate them to the increase in number of infected individuals (Hospital beds10, etc).

After organizing the data and removing countries with missing values, 78 countries remain represented in the dataset, out of the 195 countries in the world (approximately 40%)11. That is about 46.7% of the data used in Project 2, which included 167 countries.

Table 1.Sample for 5 randomly chosen countries of the data set used in this study
Country cumulative_confirmed_cases_per_10000 Stringency_Index Economic_Support_Index Economic_Support_Index_levels
Lithuania 32.667778 52.78 75 [75,87.5)
Namibia 50.113649 62.96 75 [75,87.5)
Brazil 253.668230 77.31 50 [50,62.5)
Timor-Leste 0.224264 33.33 100 100
Lebanon 99.886037 74.07 25 [25,37.5)
Country Population2019 age15_64_population_prop_2019 nurses_midwives_per_1000_2018 Smoking_prevalence_15_2016
Lithuania 2786844 64.70614 9.8472 28.8
Namibia 2494530 59.48604 1.9540 21.4
Brazil 211049527 69.73920 10.1190 13.9
Timor-Leste 1293119 58.42138 1.6680 42.6
Lebanon 6855713 67.15449 1.6735 33.8

II. Exploratory data analysis


Table 2: Summary for the cumulative confirmed cases per 10,000
n min median mean max sd
78 0.0334753 27.50465 67.20095 461.5392 92.11358

Our total sample size was 78 (Table 2). The mean cumulative confirmed cases (CCC) per 10,000 is about 67.20, far greater than our median 27.50, indicating that our CCC distribution is heavily right-skewed, which can easily be observed in Figure 1. This is to be expected for the lowest CCC possible is 0 whereas there is no such bound for the highest number. Most countries have their CCC within the 300-mark, we also notice the existence of some very extreme outliers around the 450-mark.

Figure 1. Distribution for the cumulative confirmed cases per 10,000 for individual countries

Figure 1. Distribution for the cumulative confirmed cases per 10,000 for individual countries

The distribution of the Stringency Index (Figure 2), which measures government response, seems to resemble a bell shape although there is a slight skew on the left tail. The Economic Support Index distribution (Figure 3), which records measures such as income support and debt relief, also seems to be a bit left-skewed. We notice that the distribution looks bimodal at 50 and 75, but suspect that could be due to rounding. In figure 4, the proportion of population does seem slightly left skewed, with 2 outliers around the 85-mark but they do not seem too extreme or influential. Figure 5 shows an extremely right-skewed distribution of nurses and midwives. Finally, smoking prevalence for 15+ years olds looks reasonably normally distributed.

Figure 2. Distribution for the government response measured by the Stringency Index

Figure 2. Distribution for the government response measured by the Stringency Index

Figure 3. Distribution for the government response measured by the Economic Support Index

Figure 3. Distribution for the government response measured by the Economic Support Index

Figure 4. Distribution for the Proportion of population that is 15-64 years old, in 2019 for individual countries

Figure 4. Distribution for the Proportion of population that is 15-64 years old, in 2019 for individual countries

Figure 5. Distribution for nurses and midwives per 1000 in 2018 for individual countries

Figure 5. Distribution for nurses and midwives per 1000 in 2018 for individual countries

Figure 6. Distribution for the Smoking Prevalence for 15+ years olds, in 2016 for individual countries

Figure 6. Distribution for the Smoking Prevalence for 15+ years olds, in 2016 for individual countries

In figure 7.1, the scatterplot shows that there seems so be some correlation between the cumulative confirmed cases per 10,000 (CCC) and the Stringency Index, which suggests that, without implying any causal effect, countries with a higher number of cases per 10,000 tend to also have strict policies on pandemic response. It is worth noting that there exist outliers (we consider the one that passes the 400-mark of CCC) that might have more influence on the best fit line. We also included a Loess curve, and it implies an upward trend, before dropping (but please also be cautious about the effect of overfitting). We also notice that for the cases of (almost) 0 CCC for many countries, the response (Stringency Index) diverses the most (from 0 to 100) compared to other levels, with more points clustering in the [50,75] range. This diversity is also true for Economic Support, which suggests that countries with very low CCC also spend a variable amount on income support and debt relief packages. However, countries that have more CCC definitely tend to spend more on said packages.

Figure 7.1. Interactive Scatterplot for the cumulative confirmed cases per 10,000 for individual countries against their government response measured by the Stringency Index. The red line is the best fit line. The blue curve is the Loess curve. The vertical black lines indicate the chosen knot locations at the 5th percentile SI = 36.1100, 35th percentile SI = 61.0640, 65th percentile SI= 77.3335, 95th percentile SI= 92.7295

The scatter plot in Figure 7.2 for the CCC against Stringency Index grouped by ESI shows drastically different slopes for each interval of ESI, which suggests complex behaviors of the data, which can be better observed in figure 12.

Figure 7.2. Interactive Scatterplot for the cumulative confirmed cases per 10,000 for individual countries against their government response measured by the Stringency Index, grouped by the Economic Support Index levels

Figure 7.2. Interactive Scatterplot for the cumulative confirmed cases per 10,000 for individual countries against their government response measured by the Stringency Index, grouped by the Economic Support Index levels

The scatter plot in Figure 8 for the CCC against Economic Support Index has more points on the bottom and fewer at the top. This implies that countries with lower cases per 10,000 individuals tend to spend less on economic relief packages.

Figure 8. Interactive Scatterplot for the cumulative confirmed cases per 10,000 for individual countries against their government response measured by the Economic Support Index. The red line is the best fit line. The blue curve is the Loess curve.

Figure 9, 10, 11 shows the scatterplots of CCC against the proportion of population that is 15-64 years old, the smoking prevalence of 15+ years old, and nurses and midwives per 1,000. All except smoking prevalence shows some relationship between CCC and the respective predictor.

Figure 9. Interactive Scatterplot for the cumulative confirmed cases per 10,000 for individual countries against their Proportion of population that is 15-64 years old, in 2019. The red line is the best fit line. The blue curve is the Loess curve. The vertical black lines indicate the chosen knot locations at the 5th percentile APP = 53.54730, 35th percentile APP = 62.88236, 65th percentile APP = 65.88236, 95th percentile APP = 72.63432.

Figure 10. Interactive Scatterplot for the cumulative confirmed cases per 10,000 for individual countries against their Smoking prevalence for 15+ year olds in 2016. The red line is the best fit line. The blue curve is the Loess curve.

Figure 11. Interactive Scatterplot for the cumulative confirmed cases per 10,000 for individual countries against their nurses and midwives per 1000 in 2018. The red line is the best fit line. The blue curve is the Loess curve. The vertical black lines indicate the chosen knot locations at the 5th percentile NM = 0.434670, 35th percentile NM = 1.654765, 65th percentile NM = 4.196680, 95th percentile NM = 12.579690.

Figure 12. Boxplot of relationship between  the cumulative confirmed cases per 10,000 for individual countries for the Economic Support Index levels

Figure 12. Boxplot of relationship between the cumulative confirmed cases per 10,000 for individual countries for the Economic Support Index levels


III. Multiple linear regression

i. Methods


The last paragraph implied that the slope changes intensely for different intervals of Economic Support Index, thus we recognize that a linear model might not be the best model to capture this complex behavior of the given data, so we decided to make use of the natural spline model. We’ve chosen the knots based on the density of the data distribution in a way that makes sure each interval has a similar amount of points.

Since the exploratory part shows that the distribution of our Y variable is extremely right-skewed and has many influential outliers, we have decided that it is in our best interest to transform the data to get rid of this problem. We also recognize the danger of overfitting, so unlike the last project, we will not be using box-cox to optimize the transformation (for this set of data), but rather go with a more “natural” type of transformation: taking the square root.

Figure 13. Distribution for the cumulative confirmed cases per 10,000 raised to 0.5, for individual countries

Figure 13. Distribution for the cumulative confirmed cases per 10,000 raised to 0.5, for individual countries

Using the following model:

## lm(formula = cumulative_confirmed_cases_per_10000_transf ~ ns(Stringency_Index, 
##     knots = c(36.11, 61.064, 77.3335, 92.7295)) + Economic_Support_Index + 
##     ns(age15_64_population_prop_2019, knots = c(53.5473, 62.88236, 
##         65.88236, 72.63432)) + ns(nurses_midwives_per_1000_2018, 
##     knots = c(0.43467, 1.654765, 4.19668, 12.57969)) + Smoking_prevalence_15_2016, 
##     data = tidy_joined_dataset)

From figure 14 - 21, we observe that, though not perfect, the plots has shown more promising results: the distribution of error terms is more bell-shaped, the normal Q-Q plot shows an almost straight line, and the residual scatter plots (figure 16, 17, 18, 19, 20, 21) are cloud-shaped. We may conclude that the transformation has allowed our assumptions about the model to be reasonably met in order to proceed with our analysis.

Figure 14. Normal Q-Qplot for the cumulative number of confirmed cases per 10000, raised to 0.5

Figure 14. Normal Q-Qplot for the cumulative number of confirmed cases per 10000, raised to 0.5

Figure 15. Residuals distribution for the statistical model

Figure 15. Residuals distribution for the statistical model

Figure 16. Residuals graph for the fitted values, with a Lowess curve in blue and a horizontal line at zero in red.

Figure 16. Residuals graph for the fitted values, with a Lowess curve in blue and a horizontal line at zero in red.

Figure 17. Residuals graph for the Stringency Index, with a Lowess curve in blue and a horizontal line at zero in red.

Figure 17. Residuals graph for the Stringency Index, with a Lowess curve in blue and a horizontal line at zero in red.

Figure 18. Residuals graph for the Economic Support Index, with a Lowess curve in blue and a horizontal line at zero in red.

Figure 18. Residuals graph for the Economic Support Index, with a Lowess curve in blue and a horizontal line at zero in red.

Figure 19. Residuals graph for the Proportion of population that is 15-64 years old, in 2019, with a Lowess curve in blue and a horizontal line at zero in red.

Figure 19. Residuals graph for the Proportion of population that is 15-64 years old, in 2019, with a Lowess curve in blue and a horizontal line at zero in red.

Figure 20. Residuals graph for the Smoking prevalence for 15+ year olds in 2016, with a Lowess curve in blue and a horizontal line at zero in red.

Figure 20. Residuals graph for the Smoking prevalence for 15+ year olds in 2016, with a Lowess curve in blue and a horizontal line at zero in red.

Figure 21. Residuals graph for the nurses and midwives per 1000 in 2018, with a Lowess curve in blue and a horizontal line at zero in red.

Figure 21. Residuals graph for the nurses and midwives per 1000 in 2018, with a Lowess curve in blue and a horizontal line at zero in red.

In table 3, we see that the GVIF value for the variables with 1 degree of freedom each, and the GVIF^(1/(2*Df)) value for the variables with more than 1 degree of freedom each are all between 1 and 5. This indicates that there is a moderate correlation between the predictor variables. Since there is not a lot of multicollinearity between the predictor variables, the statistical power of the model is not greatly reduced.

Table 3: VIF table
GVIF Df GVIF^(1/(2*Df))
ns(Stringency_Index, knots = c(36.11, 61.064, 77.3335, 92.7295)) 3.249592 5 1.125079
Economic_Support_Index 1.349972 1 1.161883
ns(age15_64_population_prop_2019, knots = c(53.5473, 62.88236, 65.88236, 72.63432)) 3.108132 5 1.120082
ns(nurses_midwives_per_1000_2018, knots = c(0.43467, 1.654765, 4.19668, 12.57969)) 5.211657 5 1.179499
Smoking_prevalence_15_2016 1.440917 1 1.200382

ii. Model Results and Interpretation

## lm(formula = cumulative_confirmed_cases_per_10000_transf ~ ns(Stringency_Index, 
##     knots = c(36.11, 61.064, 77.3335, 92.7295)) + Economic_Support_Index + 
##     ns(age15_64_population_prop_2019, knots = c(53.5473, 62.88236, 
##         65.88236, 72.63432)) + ns(nurses_midwives_per_1000_2018, 
##     knots = c(0.43467, 1.654765, 4.19668, 12.57969)) + Smoking_prevalence_15_2016, 
##     data = tidy_joined_dataset)

Given the nature of splines, interpretation of the model coefficients is deemed futile as ceteris paribus (all else unchanged) is not a possibility to predict the average number of cumulative cases per 10000 transformed to the power of 0.5. Alternatively, our focus on examining the coefficients and their relative significance compared to other models relies on Omnibus test results that we go over in the ANOVA table analysis section.

However, what our coefficient p-values in table 4 tell us is that the stringency index with its first 4 levels, age and population with its 5th level, and nurses and midwives per 1000 levels 3, 4, and 5 share the trait of their levels having a p-value<0.05, leading us to find them helpful in our model for predicting the average number of cumulative cases per 10000 transformed to the power of 0.5.

Whereas Economic Stringency Index and Smoking prevalence were found to be insignificant with p-values>0.05.


Table 4. Model Summary Table
Estimate Std. Error t value Pr(>|t|)
(Intercept) -10.4612 4.7789 -2.1890 0.0325
ns(Stringency_Index, knots = c(36.11, 61.064, 77.3335, 92.7295))1 10.1282 3.6028 2.8112 0.0067
ns(Stringency_Index, knots = c(36.11, 61.064, 77.3335, 92.7295))2 9.3930 4.0014 2.3474 0.0222
ns(Stringency_Index, knots = c(36.11, 61.064, 77.3335, 92.7295))3 17.0716 3.5971 4.7459 0.0000
ns(Stringency_Index, knots = c(36.11, 61.064, 77.3335, 92.7295))4 16.2272 7.7118 2.1042 0.0396
ns(Stringency_Index, knots = c(36.11, 61.064, 77.3335, 92.7295))5 0.7466 3.0984 0.2410 0.8104
Economic_Support_Index 0.0257 0.0175 1.4694 0.1470
ns(age15_64_population_prop_2019, knots = c(53.5473, 62.88236, 65.88236, 72.63432))1 1.1669 3.1130 0.3748 0.7091
ns(age15_64_population_prop_2019, knots = c(53.5473, 62.88236, 65.88236, 72.63432))2 6.4597 3.4426 1.8764 0.0655
ns(age15_64_population_prop_2019, knots = c(53.5473, 62.88236, 65.88236, 72.63432))3 2.8432 3.7089 0.7666 0.4463
ns(age15_64_population_prop_2019, knots = c(53.5473, 62.88236, 65.88236, 72.63432))4 11.7261 6.7166 1.7458 0.0860
ns(age15_64_population_prop_2019, knots = c(53.5473, 62.88236, 65.88236, 72.63432))5 12.4694 3.2654 3.8186 0.0003
ns(nurses_midwives_per_1000_2018, knots = c(0.43467, 1.654765, 4.19668, 12.57969))1 1.7848 2.2335 0.7991 0.4274
ns(nurses_midwives_per_1000_2018, knots = c(0.43467, 1.654765, 4.19668, 12.57969))2 -0.0042 3.4184 -0.0012 0.9990
ns(nurses_midwives_per_1000_2018, knots = c(0.43467, 1.654765, 4.19668, 12.57969))3 9.8196 3.7608 2.6110 0.0114
ns(nurses_midwives_per_1000_2018, knots = c(0.43467, 1.654765, 4.19668, 12.57969))4 8.7379 4.2247 2.0683 0.0429
ns(nurses_midwives_per_1000_2018, knots = c(0.43467, 1.654765, 4.19668, 12.57969))5 8.2892 2.9914 2.7710 0.0074
Smoking_prevalence_15_2016 -0.0427 0.0498 -0.8574 0.3946

Seeing the adjusted R-squared of 0.542 using our model, we found that it explains a lot of variability of the average number of cumulative cases per 10000 transformed to the power of 0.5 which, coupled with the significance of the predictors and the low p-value of 3.565e-0.8 for our model, lead us to believe it is helpful in its explanatory ability.

Value df
Residual Standard Error 3.467 60
Multiple R-squared 0.643
Adjusted R-squared 0.542
Value Numerator df Denominator df
Model F-statistic 6.363 17 60
P-value 3.565e-08

iii. Inference for multiple regression

Interpretation of the ANOVA table result from table 5:

The Stringency Index with knots at 36.11, 61.064, 77.3335 and 92.7295 with 5 degrees of freedom keeps adding 78.1337 sum of squares. With an F value = 6.4999 and p-value = 0.0001, we can conclude that the Stringency Index alone in the model explains a significant amount of variability.

The Economic Support Index with 1 degree of freedom keeps adding 126.2974 sum of squares. With an F value = 10.5065 and p-value = 0.0019, we can conclude that the model with Economic Support Index, given that the Stringency Index with knots at 36.11, 61.064, 77.3335 and 92.7295 is in the model, is statistically significant.

The Proportion of the population that is 15-64 years old in 2019 with a knot at 3.5473, 62.88236, 65.88236 and 72.63432 with 5 degrees of freedom keeps adding 106.2663 sum of squares. With an F value =15.644 and p-value<0.0001, we can conclude that the proportion of the population that is 15-64 years old in the model, given that the Stringency Index (with knots at 36.11, 61.064, 77.3335 and 92.7295) and Economic Support Index are in the model, is statistically significant.

The nurses and midwives per 1000 in 2018 with a knot at 0.43467, 1.654765, 4.19668 and 12.57969 with 5 degrees of freedom keeps adding 48.6342 sum of squares. With an F value=4.0458 and p-value<0.05, we can conclude that the nurses and midwives per 1000 in the model, given that the Stringency Index (with knots at 36.11, 61.064, 77.3335 and 92.7295), the Economic Support Index, and Proportion of the population that is 15-64 years old in 2019 (with a knot at 3.5473, 62.88236, 65.88236 and 72.63432) are in the model, is statistically significant.

The Smoking prevalence for 15+ year olds in 2016 with 1 degree of freedom keeps adding 8.8368 sum of squares. With an F value=0.7351 and p-value=0.3946, we can conclude that the nurses and midwives per 1000 in the model, given that the Stringency Index (with knots at 36.11, 61.064, 77.3335 and 92.7295), the Economic Support Index, and Proportion of the population that is 15-64 years old in 2019 (with a knot at 3.5473, 62.88236, 65.88236 and 72.63432), and the nurses and midwives per 1000 in 2018 (with a knot at 0.43467, 1.654765, 4.19668 and 12.57969) are in the model, is not statistically significant at a significance level of 0.05.

Table 5. ANOVA Table
Df Sum Sq Mean Sq F value Pr(>F)
ns(Stringency_Index, knots = c(36.11, 61.064, 77.3335, 92.7295)) 5 390.6686 78.1337 6.4999 0.0001
Economic_Support_Index 1 126.2974 126.2974 10.5065 0.0019
ns(age15_64_population_prop_2019, knots = c(53.5473, 62.88236, 65.88236, 72.63432)) 5 531.3317 106.2663 8.8402 0.0000
ns(nurses_midwives_per_1000_2018, knots = c(0.43467, 1.654765, 4.19668, 12.57969)) 5 243.1708 48.6342 4.0458 0.0031
Smoking_prevalence_15_2016 1 8.8368 8.8368 0.7351 0.3946
Residuals 60 721.2501 12.0208 NA NA

From figure 22, the wider pink 95% for the transformed cumulative confirmed cases per 10,000 individuals (raised to the power of 0.5) against their government response measure by the Stringency Index. The other predictor variables are set equal to their median: median economic support index = 50, median population proportion of ages 15 to 64 in 2019 = 64.65951, median nurses midwives per 1000 in 2018 = 2.47285, and median Smoking prevalence for people ages 15+ in 2016 = 17.05. The knots chosen are shown by the black vertical lines at the 5th percentile SI = 36.1100, 35th percentile SI = 61.0640, 65th percentile SI= 77.3335, 95th percentile SI= 92.7295. From the Stringency Index levels 20 to 80 and from 95 to 100, there are chances that the predicted transformed CCC per 10,000 can be 0. The majority of the countries are between Stringency Index 50 to 95. Around this range, the 95% PI is quite constant in width.

With predictor variable values other than the spline on the Stringency Index held constant at the aforementioned values, we notice our 95% confidence intervals in the 1st level to be very wide and falling into negative numbers of the cumulative confirmed cases per 10,000 (raised to 0.5), along with the spline showing that our accuracy is not reliable when it comes to the 95% confidence intervals before the first knot. We see a narrowing of the confidence intervals toward the 3rd and 4th levels showing a relatively better accuracy in these levels compared to their counterparts.

Figure 22. Interactive Scatterplot for the cumulative confirmed cases per 10,000 (raised to 0.5) for individual countries against their government response measured by the Stringency Index, where median economic support index = 50, median population proportion of ages 15 to 64 in 2019 = 64.65951, median nurses midwives per 1000 in 2018 = 2.47285, and median Smoking prevalence for people ages 15+ in 2016 = 17.05. The blue line is the spline, with its associated 95% CI and wider pink 95% PI. The vertical black lines indicate the chosen knot locations at the 5th percentile SI = 36.1100, 35th percentile SI = 61.0640, 65th percentile SI= 77.3335, 95th percentile SI= 92.7295.

From figure 23, the wider pink 95% for the transformed cumulative confirmed cases per 10,000 individuals (raised to the power of 0.5) against their nurses and midwives per 1000 in 2018. The other predictor variables are set equal to their median: median Stringency index = 70.14, median economic support index = 50, median population proportion of ages 15 to 64 in 2019 = 64.65951, and median Smoking prevalence for people ages 15+ in 2016 = 17.05 The knots chosen are shown by the black vertical lines at the 5th percentile NM = 0.434670, 35th percentile NM = 1.654765, 65th percentile NM = 4.196680, 95th percentile NM = 12.579690. The majority of countries are from 0 to 5. In this range, there are chances that the predicted transformed CCC ( raised to the power of 0.5 ) is 0. The 95% PI around this range is quite constant in width. On the other hand, around the 4th knot at the 95th percentile NM =12.579690, the 95% PI increase in width at the end of the predictor range.

With predictor variable values other than the spline on the Nurses and Midwives held constant at the aforementioned values, we see our 95% confidence intervals being very wide especially through the 1st, 2nd and 5th levels. With the 1st level being only 5 percentiles in width and the 2nd level having wide intervals, it points to the inaccuracy of our confidence intervals for our nurses and midwives in that segment. The intervals are narrower at the 3rd and 4th levels along the range of 5 to 13 cumulative confirmed cases per 10,000 (raised to 0.5).

Figure 23. Interactive Scatterplot for the cumulative confirmed cases per 10,000 (raised to 0.5) for individual countries against their nurses and midwives per 1000 in 2018, where median Stringency index = 70.14, median economic support index = 50, median population proportion of ages 15 to 64 in 2019 = 64.65951, and median Smoking prevalence for people ages 15+ in 2016 = 17.05. The blue line is the spline, with its associated 95% CI and wider pink 95% PI. The vertical black lines indicate the chosen knot locations at the 5th percentile NM = 0.434670, 35th percentile NM = 1.654765, 65th percentile NM = 4.196680, 95th percentile NM = 12.579690.

From figure 24, the wider pink 95% for the transformed cumulative confirmed cases per 10,000 individuals (raised to the power of 0.5) against their Proportion of population that is 15-64 years old, in 2019. The other predictor variables are set equal to their median: median Stringency index = 70.14, median economic support index = 50, median nurses midwives per 1000 in 2018 = 2.47285, and median Smoking prevalence for people ages 15+ in 2016 = 17.05. The knots chosen are shown by the black vertical lines at the 5th percentile APP = 53.54730, 35th percentile APP = 62.88236, 65th percentile APP = 65.88236, 95th percentile APP = 72.63432. The majority of the countries are between the first knot at the 5th percentile and the 4th knot at the 95th percentile. In the 0 to 75 range of the proportion of the population that is 15-64 years old, there are chances that the predicted transformed CCC ( raised to the power of 0.5 ) is 0.

With predictor variable values other than the spline on the proportion of the population being 15-64 held constant at the aforementioned values, our 95% confidence intervals are the tightest in the 3rd level while they widen at the ends.

Figure 24. Interactive Scatterplot for the cumulative confirmed cases per 10,000 (raised to 0.5) for individual countries against their Proportion of population that is 15-64 years old, in 2019, where median Stringency index = 70.14, median economic support index = 50, median nurses midwives per 1000 in 2018 = 2.47285, and median Smoking prevalence for people ages 15+ in 2016 = 17.05. The blue line is the spline, with its associated 95% CI and wider pink 95% PI.The vertical black lines indicate the chosen knot locations at the 5th percentile APP = 53.54730, 35th percentile APP = 62.88236, 65th percentile APP = 65.88236, 95th percentile APP = 72.63432.

From figure 25, the wider pink 95% for the transformed cumulative confirmed cases per 10,000 individuals (raised to the power of 0.5) against their Smoking prevalence for 15+-year-olds in 2016. The other predictor variables are set equal to their median: median Stringency index = 70.14, median economic support index = 50, median population proportion of ages 15 to 64 in 2019 = 64.65951, and median nurses midwives per 1000 in 2018 = 2.47285. All the countries spread out from 0 to 40 smoking prevalence. From 0 to 40 smoking prevalence, there is a chance that the predicted transformed CCC per 10,000 can be 0.

With predictor variable values other than Smoking prevalence for 15+ year olds held constant at the aforementioned values, our 95% confidence intervals are mostly constant in their width but slightly wider at the ends.

Figure 25. Interactive Scatterplot for the cumulative confirmed cases per 10,000 (raised to 0.5) for individual countries against their Smoking prevalence for 15+ year olds in 2016, where median Stringency index = 70.14, median economic support index = 50, median population proportion of ages 15 to 64 in 2019 = 64.65951, and median nurses midwives per 1000 in 2018 = 2.47285. The blue line is the spline, with its associated 95% CI and wider pink 95% PI.

From figure 26, the wider pink 95% for the transformed cumulative confirmed cases per 10,000 individuals (raised to the power of 0.5) against their government response measured by the Economic Support Index. The other predictor variables are set equal to their median: median Stringency index = 70.14, median population proportion of ages 15 to 64 in 2019 = 64.65951, median nurses midwives per 1000 in 2018 = 2.47285, and median Smoking prevalence for people ages 15+ in 2016 = 17.05. All the countries spread out from 0 to 100. From 0 to 100, there is a chance that the predicted transformed CCC per 10,000 can be 0.

With predictor variable values other than the Economic Support Index held constant at the aforementioned values, our 95% confidence intervals are mostly constant in their width but slightly wider at the ends.

Figure 26. Interactive Scatterplot for the cumulative confirmed cases per 10,000 (raised to 0.5) for individual countries against their government response measured by the Economic Support Index, where median Stringency index = 70.14, median population proportion of ages 15 to 64 in 2019 = 64.65951, median nurses midwives per 1000 in 2018 = 2.47285, and median Smoking prevalence for people ages 15+ in 2016 = 17.05. The blue line is the spline, with its associated 95% CI and wider pink 95% PI.

The 95% Prediction intervals (in table 6) for the predicted transformed CCC per 10,000 individuals; for example, a country with a Stringency Index equals 30, and the other predictor variables are set equal to their median: median economic support index equal 50, median population proportion of ages 15 to 64 in 2019 = 64.65951, median nurses midwives per 1000 in 2018 = 2.47285, and median Smoking prevalence for people ages 15+ in 2016 = 17.05, the transformed cumulative cases per 10,000 are predicted to be between -8.48049 and 7.43092.

It is similar to other Stringency indices 50, 70.14, and 90 in the prediction intervals table. In other words, for any country with a Stringency Index as 50, 70.14, or 90 (with the other predictor variables set equal to their median), the transformed cumulative cases are predicted to be between the lower and upper limit in table 6.

Table 6. The 95% Prediction intervals for the $ ext{cumulative confirmed cases per 10,000}^{0.5}$, where Stringency Index = 20, 50, 70.14, 90, respectively, for median economic support index = 50, median population proportion of ages 15 to 64 in 2019 = 64.65951, median nurses midwives per 1000 in 2018 = 2.47285, and median Smoking prevalence for people ages 15+ in 2016 = 17.05.
SI Point Estimate Lower Limit Upper Limit
30.00 -0.52479 -8.48049 7.43092
50.00 4.94393 -2.50340 12.39126
70.14 5.74856 -1.50817 13.00530
90.00 10.64518 3.13071 18.15964

IV. Discussion

i. Conclusions

We recognize that interpretability is sometimes to be traded for the sake of a better model. Our analysis shows that the model we proposed seems to be helpful as it explains quite a good amount of variability in cumulative confirmed cases of covid-19 per 10,000 individuals (54.74%).

We see evidence to suggest that CCC is positively correlated with Stringency and Economic Support Index, which aligns with our expectation, for it is reasonable for a government to respond strictly and spend more budget on income support packages if their people are more impacted by the pandemic. Moreover, it is also positively correlated with nurses and midwives and the proportion of 15-64-year-olds in the population, which matches the expectation that we explained in the introduction.

ii. Limitations

This project is limited by the data available. The addition of the three World Bank indicators reduced the countries that were represented in project 2 by about 53.3%, due to excluding countries with a missing value in any of the variables used. Additionally, there were some notable outliers and points with high leverage that could not be removed since they are not mistakes and were necessary, leading to keeping their effects on the model.

The choice to use a non-linear model made the interpretation of the relationship between the variables more complex and less straightforward, which is not a bad thing when used appropriately. However, no test was done to check for overfitting, so the adequacy of the complexity of the model cannot be determined.

iii. Further questions

The relationship studied in this report is not one that can be generalized to other sets of dates, since no test was done to check its generalizability. Also, sensitivity modeling can be done to check the effect of the outliers on the model. Another study can be done where the aim is the same as this study, but uses methods other than regression analysis.


V. Citations and References


  1. “Total Population” World Bank Indicators of Interest to the COVID-19 Outbreak. COVID-19 Pandemic. _ World Bank_. United Nations Office for the Coordination of Humanitarian Affairs. 2020. Accessed October 2020 https://data.humdata.org/dataset/world-bank-indicators-of-interest-to-the-covid-19-outbreak↩︎

  2. “time_series_covid19_confirmed_global.csv” Novel Coronavirus (COVID-19) Cases Data. COVID-19 Pandemic. Johns Hopkins University Center for Systems Science and Engineering (JHU CCSE). United Nations Office for the Coordination of Humanitarian Affairs. 2020. Accessed October 2020 https://data.humdata.org/dataset/novel-coronavirus-2019-ncov-cases↩︎

  3. “OxCGRT_CSV” OXFORD COVID-19 Government Response Stringency index, COVID-19 Pandemic. The Oxford COVID-19 Government Response Tracker. United Nations Office for the Coordination of Humanitarian Affairs. 2020. Accessed October 2020 https://data.humdata.org/dataset/oxford-covid-19-government-response-tracker↩︎

  4. “Health - Smoking prevalence, total, ages 15+” World Bank Indicators of Interest to the COVID-19 Outbreak. COVID-19 Pandemic. _ World Bank_. United Nations Office for the Coordination of Humanitarian Affairs. 2020. Accessed November 2020 https://data.humdata.org/dataset/world-bank-indicators-of-interest-to-the-covid-19-outbreak↩︎

  5. “Health - Nurses and midwives (per 1,000 people)” World Bank Indicators of Interest to the COVID-19 Outbreak. COVID-19 Pandemic. _ World Bank_. United Nations Office for the Coordination of Humanitarian Affairs. 2020. Accessed November 2020 https://data.humdata.org/dataset/world-bank-indicators-of-interest-to-the-covid-19-outbreak↩︎

  6. “Age and Population - Population ages 15-64 (% of total population))” World Bank Indicators of Interest to the COVID-19 Outbreak. COVID-19 Pandemic. _ World Bank_. United Nations Office for the Coordination of Humanitarian Affairs. 2020. Accessed November 2020 https://data.humdata.org/dataset/world-bank-indicators-of-interest-to-the-covid-19-outbreak↩︎

  7. “Coronavirus disease(COVID-19): Prevention and risks.” Government of Canada. Accessed: November 2020 https://www.canada.ca/en/public-health/services/diseases/2019-novel-coronavirus-infection/prevention-risks.html↩︎

  8. “Health - Proportion of population spending” World Bank Indicators of Interest to the COVID-19 Outbreak. COVID-19 Pandemic. _ World Bank_. United Nations Office for the Coordination of Humanitarian Affairs. 2020. Accessed November 2020 https://data.humdata.org/dataset/world-bank-indicators-of-interest-to-the-covid-19-outbreak↩︎

  9. “Health - Cause of Death, by communicable diseases” World Bank Indicators of Interest to the COVID-19 Outbreak. COVID-19 Pandemic. _ World Bank_. United Nations Office for the Coordination of Humanitarian Affairs. 2020. Accessed November 2020 https://data.humdata.org/dataset/world-bank-indicators-of-interest-to-the-covid-19-outbreak↩︎

  10. “Health - Hospital beds per 1,000 people” World Bank Indicators of Interest to the COVID-19 Outbreak. COVID-19 Pandemic. _ World Bank_. United Nations Office for the Coordination of Humanitarian Affairs. 2020. Accessed November 2020 https://data.humdata.org/dataset/world-bank-indicators-of-interest-to-the-covid-19-outbreak↩︎

  11. “How many Countries are there in the World?”, Worldometer, 2020. Accessed October 2020 https://www.worldometers.info/geography/how-many-countries-are-there-in-the-world/↩︎